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Abstract 

Many real world network problems often concern multivariate nodal attributes such as image, tex- 
tual, and multi-view feature vectors on nodes, rather than simple univariate nodal attributes. The 
existing graph estimation methods built on Gaussian graphical models and covariance selection 
algorithms can not handle such data, neither can the theories developed around such methods 
be directly applied. In this paper, we propose a new principled framework for estimating multi- 
attribute networks. Instead of estimating the partial correlation as in current literature, our method 
estimates the partial canonical correlations that naturally accommodate complex nodal features. 
Computationally, we provide an efficient algorithm which utilizes the multi-attribute structure. 
Theoretically, we provide sufficient conditions which guarantee consistent graph recovery. Empiri- 
cally, we apply our method on a genomic dataset to illustrate its usefulness. 

Keywords: Graphical model selection, high-dimensional analysis, multi-attribute data, network 
analysis 
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1. INTRODUCTION 

In many modern problems, we are interested in studying a network of entities with complex at- 
tributes rather than a simple univariate attribute. For example, when an entity represents a person 
as in a social network, it is widely accepted that the nodal attribute is most naturally a vector 
with many personal information including demographics, interests, and other features, rather than 
merely a single attribute, such as a binary vote as assumed in current literature of social graph 
estimation based on a Markov random fields (MRF) (Banerjee, Ghaoui and d'Aspremont 2008; Ko- 
lar, Song, Ahmed and Xing 2010). In another example, when an entity represents a gene as in a 
gene regulation network, modern technologies allow researchers to measure the activities of a single 
gene in a high-dimensional space, such as an image of spatial distribution of the gene expression, 
or a multi-view snapshot of the gene activity such as mRNA and protein abundances, rather than 
merely a single attribute such as an expression level, which is assumed in current literature on gene 
graph estimation based on a Gaussian graphical model (GGM) (Peng, Wang, Zhou and Zhu 2009). 
Indeed, it is somewhat surprising that existing research on graph estimation remains largely blinded 
to analysis of multi-attribute data that are prevalent and widely studied in the network commu- 
nity, even though existing algorithms and theoretical analysis based mainly on covariance selection 
using graphical lasso, or penalized pseudo-likelihood, do not apply to graphs with multi-variate 
nodal attributes. 

In this paper, we present a study on graph estimation from multi-attribute data, in an attempt 
to fill the gap between the practical needs and what existing methodologies offer as mentioned 
above. Recall that in a GGM, one assumes that a sample from the entire graphical model is a 
p-dimensional random vector X = {Xi, . . . , Xp)' of which each dimension corresponds to a node, 
and X ~ A/'(/i, S). Based on n i.i.d. observations, one can estimate an undirected graph G = 
(y,E), where the node set V corresponds to the p variables in X, and the edge set E describes 
the conditional independence relationships among Xi, . . . ,Xp, i.e., Xa is independent of X^ given 
X\|a {,} for all (a, b) ^ E, where X^j^ {,} represents all the variables in X except Xa and Xb. Given 
multi-attributes data, this approach is clearly invalid, because it naively translates to estimating 
one graph per attribute; a subsequent integration of all such graphs to a summary graph on the 
entire dataset would lead to unclear statistical interpretation. 
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We consider the following new setting for estimating a multi-attribute graph. Assume now a 
"stacked" long random vector X = (X'^, X^)' where Xi € ]R^'^,...,Xp € M.^p are themselves 
random vectors that jointly follow the multivariate Normal distribution, 
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Without loss of genera 
the vertex set V = [pjl 



ity, we assume /^i = 0, . . . , /ip = 0. Let G = {V, E) be a graph with 
and the set of edges E ^ V x V that encodes conditional independence 
relationships among (Xa)agy. That is, each node a ^ V oi the graph G corresponds to the 
random vector X^ and there is no edge between nodes a and h in the graph if and only if X(j 
is conditionally independent of X^ given all the vectors corresponding to the remaining nodes. 
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{Xc : c G [p]\{a, 6}}. Conditional independence can be read from the inverse of the 



covariance matrix, as the block corresponding to Xq and X^ will be equal to zero. Let = {xj} .^j^j 
be a sample of n i.i.d. vectors drawn from A/'(0,X1). For a vector Xj, we denote Xj^^ G M'^'" the 
component corresponding to the node a £V. Our goal is to estimate the structure of the graph G 
from the sample P„. Note that we allow for different nodes to have different number of attributes, 
which may be useful in certain applications, e.g., when a node represents a gene pathway in a 
regulatory network. 

Using the standard Gaussian graphical model for univariate nodal observations, one can estimate 
a graph for each attribute individually, by estimating the sparsity pattern of the precision matrix 
ft = of the GMM. This is also known as covariance selection (Dempster 1972). For high 
dimensional problems, Meinshausen and Biihlmann (2006a) propose a parallel Lasso approach for 
estimating Gaussian graphical models by solving a collection of sparse regression problems. This 
procedure can be viewed as a pseudo likelihood based method. In contrast, Banerjee et al. (2008), 
Yuan and Lin (2007), and Friedman, Hastie and Tibshirani (2008) take a penalized likelihood 
approach to estimate the sparse precision matrix Q. To reduce estimation bias. Lam and Fan 



^We use [p] to represent the set {!,... ,p}. 
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(2009), Jalali, Johnson and Ravikumar (2012), and Shen, Pan and Zhu (2012) developed the non- 
concave penalties to penalize the Ukehhood function. More recently, Yuan (2010) and Cai, Liu 
and Luo (2011) proposed the graphical Dantzig selector and CLIME, which can be solved by 
linear programming and have better theoretical properties than the penalized likelihood approach. 
Under certain regularity conditions, these methods have proven to be graph estimation consistent 
(Ravikumar, Wainwright, Raskutti and Yu 2011; Zou 2006; Zhao and Yu 2006; Wainwright 2009). 
Scalable software packages such as glasso and huge were developed to implement these algorithms 
(Zhao, Liu, Roeder, Lafferty and Wasserman 2012). However, in the case of multi-attribute data, it 
is not clear how to combine estimated networks to obtain a single network reflecting the structure 
of the underlying complex system. This is especially the case when nodes in the graph contain 
different number of attributes. 

Unlike the standard procedures for learning the structure of GGMs (e.g., neighborhood selec- 
tion (Meinshausen and Brihlmann 20066) or glasso (Friedman et al. 2008)), which infer the partial 
correlations between pairs of nodes, our proposed method estimates the partial canonical corre- 
lations between pairs of nodes, which leads to a graph estimator over multi-attribute nodes that 
bears the same probabilistic independence interpretations as that of the graph from GGM over 
univariate nodes. Under this new framework, the contributions of this paper include: (i) computa- 
tionally, an efficient algorithm is provided to estimate the multi-attribue graphs; (ii) theoretically, 
we provide sufficient conditions which guarantee consistent graph recovery; and (iii) empirically, 
we apply and compare different methods on a genomic dataset to illustrate the usefulness of our 
method. The rest of this paper is organized as follows. In ^ we provide the general modeling 
framework and illustrate the relationship between the Gaussian multi-attribute graphical models 
with partial canonical correlations. In fj3l a penalized maximum likelihood estimator is proposed 
and an efficient algorithm is provided to solve it. In ^ we study the theoretical properties of the 
proposed estimator. In f|6l we provide numerical simulations, which demonstrate tightness of our 
theoretical results, while in ^ we report results on a genomic data set. Possible extensions are 
discussed in ^ Proofs are deferred to the Appendix. 
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2. PRELIMINARIES AND RELATED WORK 
We begin with a brief description of the canonical correlation originally introduced by Hotelling 
(1936). Then we present partial canonical correlation (Rao 1969) that will be used for inferring 
links between different nodes. We conclude the section by discussing related work. 

Canonical correlation, a classical tool in multivariate statistics, is defined between two multi- 
variate random variables as 

Pc(Xa,X6)= max Corr(u'Xa, v'Xfc), 

that is, computing canonical correlation between X^ and X5 is equivalent to maximization of 
correlation between two linear combinations u'Xq and v'X;, with respect to vectors u and v. 
Canonical correlation can be used to measure association strength between two nodes with multi- 
attribute observations. For example, in Katenka and Kolaczyk (2011), a graph is estimated from 
multi-attribute nodal observations by thresholding the canonical correlation between nodes, but 
such a graph estimator may confound the direct interactions with indirect ones, as we describe 
later. 

In this work, we will use the partial canonical correlation to estimate a network from multi- 
attribute nodal observations. A network is going to be formed by connecting nodes with non-zero 
partial canonical correlation. Let A = argmin E[| |Xa — AX^| I2] and B = argmin E[||X5 — BX^Hg], 
then the partial canonical correlation between Xq and X^ is defined as 

Pe(X„Xb;X^)= max Corr(u'(X, - AX^), v'(Xb - BX^)), (2) 

that is, the partial canonical correlation between X^ and X;, is equal to the canonical correlation 
between residual vectors of X^ and X^ after the effect of variables X^ is removed (Rao 1969). 

Let ri* denote the precision matrix under the model in Eq. ([1]). Using standard results for the 
multivariate Normal distribution (Lauritzen 1996) (see also Eq. 7 in Rao (1969)), a straight forward 
calculation shows thaj^ 

Pc(Xa,Xfe;X^) / max u'S7*f,v / 0. (3) 

ueR''a,veiR*^b 

^Calculation given in Appendix iDl 
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This implies that estimating whether the partial canonical correlation is zero or not can be done 
by estimating whether a block of the precision matrix is zero or not. Furthermore, under model 
in Eq. ([T]), vectors and X;, are conditionally independent given X^ if and only if the partial 
canonical correlation is zero. In fj3l we use the above observations to provide an algorithm that 
estimates the non-zero partial canonical correlation between nodes from data Dn using the penalized 
maximum likelihood estimation of the precision matrix. 

Based on the relationship given in Eq. ([3]), we can motivate an alternative method for estimating 
the non-zero partial canonical correlation. Let a = {b : b £ [p]\{ci-}} denote the set of all nodes 
minus the node a. Then 

E[xjx^ = x^] = s;^si;-ix^. 

Since ftl^^ = - ^^S*'- ^I]±^)-i5];^^S*;^ \ we observe that a zero block flab can be iden- 

tified from the regression coefficients when each component of X^ is regressed on X^. We do not 
build an estimation procedure around this observation, however, we note that this relationship 
shows how one would develop a regression based analogue of the work presented in Katenka and 
Kolaczyk (2011). 

2.1 Related Work 

Literature on estimating multi-attribute networks is sparse. Katenka and Kolaczyk (2011) uses 
canonical correlation as an association measure between two groups of attributes to estimate the 
association network. If the canonical correlation is large enough, then a link between two nodes is 
formed. However, such an association network is known to confound the direct interactions with 
indirect ones. For example, consider a network with three nodes A, B, and C, where A only interacts 
with B and B only interacts with C. If we use marginal associations, then A and C will also be 
highly correlated due to the existence of B. Thus an edge may be wrongly put in. In contrast, our 
partial canonical correlation network correctly reveals that A and C are uncorrelated if we remove 
the effect of B. The direct interactions are thus separated from the indirect confounders. Table [J 
shows which quantities are used to construct a network. 

Our work is related to the literature on simultaneous estimation of multiple Gaussian graphical 
models under a multi-task setting (Guo, Levina, Michailidis and Zhu 2011; Varoquaux, Gramfort, 
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single attribute 


multi-attribute 


Association networks 


correlation structure 


canonical correlation 
(Katenka and Kolaczyk 2011) 


Markov networks 


partial correlation structure 


partial canonical correlation 
(this work) 



Table 1: Association networks differ from Markov networks in that edges represent different quanti- 
ties. Edge in an association network reflects that two nodes are marginally correlated or dependent. 
In a Markov network, an edge denotes a conditional dependence relationship between two nodes 
after removing the (linear) effects of other nodes. 

Poline and Thirion 2010; Honorio and Samaras 2010; Chiquet, Grandvalet and Ambroise 2011; 
Danaher, Wang and Witten 2011). But our multi-attribute network estimation problem has a 
different formulation and a different purpose from their study. In particular, it is important to 
observe that the model assumed under various multi-task settings is different from the one we 
propose in Eq. ([T|) and that the optimization algorithms developed to handle the multi-task setting 
do not extend to handle the optimization problem given in Eq. below. 

3. ESTIMATION PROCEDURE 
In this section, we provide an algorithm for estimating whether the partial canonical correlation 
between nodes is zero or not, motivated by the relationship to the precision matrix. In the first 
part, we present an efficient algorithm for minimizing the penalized negative log- likelihood, whose 
convergence is proven in the second part. Finally, we show how the method can be scaled to even 
larger problems by first identifying the connected components of the estimated graph by performing 
a simple test, without minimizing the penalized negative log- likelihood, and then estimating the 
structure of each individual component. 
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3.1 Penalized Log-Likelihood Optimization 

Based on the sample P„, we propose to minimize the penalized negative log-likelihood under the 
model in ([TJ, 

min tr SfJ - log |fJ| -h liriabll^ (4) 



where S = SjG[n] ^i^i sample covariance matrix and UriafeHj? denotes the Frobenius norm 

of Qab- The Frobenius norm penalty encourages blocks of the precision matrix to be equal to zero, 
similar to the way that the £2 penalty is used in the group Lasso (Yuan and Lin 2006). The dual 
problem to (HD is 



max 
s 



y kj + log |S| subject to max ||Sab — < A, 



(5) 



where S is the dual variable to fi. Note that the primal problem gives us an estimate of the 
precision matrix, while the dual problem estimates the covariance matrix. The proposed optimiza- 
tion procedure, described below, will estimate simultaneously the precision matrix and covariance 
matrix, without explicitly performing an expensive matrix inversion. 

We propose to optimize the objective using a block coordinate descent procedure, inspired 
by Mazumder and Agarwal (2011). The block coordinate descent is an iterative procedure that 
operates on a block of rows and columns while keeping the other rows and columns fixed. Write 

/ ^ ^ \ ( 




s 



a, a ^a,a 



and suppose that (fi, S) are the current estimates of the precision and covariance matrices. With 
the block partition above, we have log|r2| = log(f2Q:^a) -|- log(f2aa — ^a:fl'^a.;a)~^^ci,a)- The next 
iterate is of the form 



f2 = n + 



and is obtained by minimizing 



tr SaaS^aa 2 tr Sa,aS^a,a " log \^aa " ^ a^aiS^a.;^ ^f^a.al + AHfiaallF + 2A ^ | |riaf,| Ijt'. (6) 
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Complete minimization of Eq. ([6|) over the variables Qaa and fla,a at each iteration of the block 
coordinate descent can be computationally expensive. Therefore, we propose to update ftaa and 
^a,a using one generalized gradient step update in each iteration. We briefly describe generalized 
gradient step update below, while a more detailed treatment can be found in Beck and Teboulle 
(2009). Note that the objective in Eq. ([6]) is a sum of a smooth convex function and a non-smooth 
convex penalty, so that the gradient descent cannot be applied. Given a step size t, generalized 
gradient descent optimizes a quadratic approximation of the objective at the current iterate fi, 
which results in the following two updates 

fiaa = argmintr(Saa - Saa)f^aa + ^ll^^aa - S^aallF + A||riaa||F, and (7) 
^ab = aigmintliSab - ^ab)^ba + -^\\^ab - ^abllp + M\^ab\\F, V6 G a. (8) 

Solutions to Eq. d?]) and Eq. ([8|) can be found in a closed form as 

aa ~ Saa)\\F) + {^aa + tCEaa - Saa)), and (9) 
nab = {l-tX/\\hab + t{i:ab-Sab)\\F)+{nab + t{±ab-Sab)), V6 G a, (10) 

where (x)-|- = max(0, x). If the resulting estimator fl is not positive definite or the update does not 
decrease the objective, we half the step size t and find new update. Once the update of the precision 
matrix, ft, is found, we need to update the covariance matrix. This can be done efficiently, without 
inverting the whole Ct matrix, using the matrix inversion lemma as follows 

^a,a ~ (^o,a) ~l" (^a,a) ^a,a{^aa ^a,a{^a,a) ^a,a) ^a,a{^a,a) 
^a,a = ~^aa^a,a^a,a (^^^ 
^aa — i^aa ^a,ai^a,a) ^a,a) 

with {fta,a)~^ = 5]a,a " ^a,a'^aa'^a,a- Combining all the steps we arrive at Algorithm [TJ Finally, 
we form a network G = (V, E) by connecting nodes with ||riab||ir / 0. 

3.2 Convergence of Algorithm [1] 

In this section we will analyze the convergence properties of Algorithm [U detailed in the previous 
section. In particular, we show that Algorithm [T] produces iterates that converge to the unique 
minimum of the objective in Eq. Q. Note that Algorithm [1] is different from the conventional block 
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coordinate descent, described in Tseng (2001), for two reasons. First, the minimization problem in 
Eq. ^ is not solved to convergence at each iteration. Recall that we only update ^irid S^^^a 
using one generalized gradient step update in each iteration. Second, blocks of variables, over which 
the optimization is done at each iteration, are not completely separable between iterations due to 
the symmetry of the problem. 

We start by providing a sequence of lemmas that characterize the optimal solution fl to Eq. @ 
and will be used to prove the convergence of Algorithm [TJ The following lemma states that the 
minimizer of Eq. is unique and has bounded minimum and maximum eigenvalues. 

Lemma 1. For every value o/A > 0, the optimization problem in Eq. @ has a unique minimizer 
fl, which satisfies Amin(f^) > (Amax(S) + Xpy^ > and Amax(f^) < Y.je[p] 

The next results states that the objective function has a Lipschitz continuous gradient, which 
will be used to show that the generalized gradient descent can be used to find fi. 

Lemma 2. The function /(A) = trSA — log|A| has a Lipschitz continuous gradient on the set 
{A G : Ajnin(A) > 7}, with the Lipschitz constant L = 7"^. 

Finally, we have a result stating the convergence of Algorithm [TJ 

Lemma 3. For every value of X > 0, AlgorithmUl produces a sequence of estimates (r2'^*^)(>i of 
the precision matrix that monotonically decrease the objective value given in Eq. are positive 
definite and converge to the unique minimizer ft of Eq. (jj]) . 

Proofs of these results are given in the Appendix lAl 
3.3 Efficient Identification of Connected Components 

In this section, we present a way to speed up Algorithm [1] when the target graph is composed 
of many smaller, disconnected components. We present necessary and sufficient condition for the 
solution n of Eq. ^ to be block-diagonal, potentially after permuting the node indices. The 
condition can be easily checked by inspecting the empirical covariance matrix S and allows us 
to achieve massive computational gains when minimizing Eq. @ by considering only one block 
of nodes at a time. Note that once the connected components are identified, the optimization 
problem in Eq. needs to be solved for a much smaller network. The equivalent condition in the 
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Algorithm 1 Minimization procedure for the objective 
Require: Empirical covariance matrix S, penalty parameter A 

1: Set the initial estimator ft = diag(S) and S) = ri~^ 

2: Set the step size t = 1 

3: repeat 

4: for a G [p] do 

5: Update fl using using ([9]) and (fTO|) 

6: Compute S using ([TT]) 

7: if ri is not positive definite then 

8: t = t/2 

9: Go back to line [5] 

10: end if 
11: Set (fi,S) ^ 
12: end for 

13: until convergence criterion is met (e.g., duality gap < e) 

14: return Estimates {fl, Xl) of the precision and covariance matrices 

case of Guassian graphical models have been described in Witten, Friedman and Simon (2011) and 
Mazumder and Hastie (2011). 

Our first result follows immediately from the KKT conditions for the optimization problem ([4]) 
and states that if CI is block-diagonal, then it can be obtained by solving a sequence of smaller 
optimization problems. 

Lemma 4. If the solution to Eq. (jH takes the form CI = diag(f2i, 4^2, ■ • • , fij); then it can he 
obtained by solving 

min trS/'fi/' — log iri^/l + A > Ili^abllF 

a,b 

separately for each I' = 1, . . . , where S;/ are suhmatrices of S corresponding to fi//. 

Next, we describe how to identify diagonal blocks of Cl. Let V = {Pi, P2, • • • , Pi} be a partition 
of the set [p] and assume that the nodes of the graph are ordered in a way that if a G Pj, 6 G Pji, 
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j < j, then a < b. The following lemma states that the blocks of il. can be obtained from the blocks 
of a thresholded sample covariance matrix. 

Lemma 5. A necessary and sufficient conditions for Q to be block diagonal with blocks Pi, P2, . . . , Pi 
is that ||Sab||F < A for all a & Pj, b £ Pji , j / j' . 

Blocks Pi , i-*2 , ■ ■ ■ , -P; can be identified by forming a p x p matrix Q with elements qab = 
IIIISafellF > A} and computing connected components of the graph with adjacency matrix Q. 
The lemma states also that given two penalty parameters Ai,A2, Ai < A2 the set of unconnected 
nodes with penalty parameter Ai is a subset of unconnected nodes with penalty parameter A2- The 
simple check above allows us to estimate networks on datasets with large number of nodes, if we 
are interested in networks with small number of edges. However, this is often the case when the 
networks are used for exploration and interpretation of complex systems. 

4. THEORETICAL RESULTS 
In this section, we provide theoretical analysis of the estimator described in ^ In particular, we 
provide sufficient conditions for the consistent graph structure recovery under the assumption that, 
for eacljfl a = 1, . . . , hp, (cr*^)"^/^^^ is a sub-Gaussian with parameter 7, where a*^ is a diagonal 
element of S*. Recall that Z is a sub-Gaussian random variable if there exists a constant a G (0, 00) 
such that 

E[exp(tZ)] < exp{a'^t^), for all t G M. 

A statement of a general result is given in the appendix, together with proofs. 

Our assumptions involve the Hessian of the function /(A) = tr SA — log |A| evaluated at the 
true ft*, n = n{ft*) = (S) {fl*y^ G ^ipk)^x{pk)^^ ^^^^ ^^^^ ^^^^ covariance matrix S*. The 

Hessian and the covariance matrix can be thought of as block matrices with blocks of size fc^ x k'^ and 
kx k, respectively. We will make use of the operator C(-) that operates on these block matrices and 
outputs a smaller matrix with elements that equal to the Probenius norm of the original blocks. 
For example, C(S;*) G Rp^p with elements C{'E*)ab = ll^afelli"- Let T = {{a,b) : HfiabllF / 0} 
and M = {{a,b) : \\flab\\F = 0}. With this notation introduced, we assume that the following 
■^For simplicity of presentation, we assume that ka = k, for all a G [p], that is, we assume that the same number 
of attributes is observed for each node. 
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irrepresentable condition holds. 

Assumption: There exists a constant a G [0, 1) such that 

|||C {nMrinrrr') llloo < l - a. (12) 

We will also need the following quantities to specify the results ks* = |||C(5]*)|||oo and = 
|||C('H^:^)|||oo- These conditions extend the conditions specified in Ravikumar et al. (2011) needed 
for estimation of networks from single attribute observations. 

We have the following result that provides sufficient conditions for recovery of the graph struc- 
ture. 

Theorem 6. Set the penalty parameter A in Eq. (j3D as 



A = 8ka~^ ^128(1 + 472)2 (max(o-*„)2)n-i (2 log(2fe) + r log(p)) , 

where r > 2. // 

n > Cis^A;^(l + 8a~^)^(rlogp + log4 + 21ogfc) 
where s is the maximal degree of nodes in G, 

Ci = (48\/2(l + 472)(maxo-*„)max(Kx;.K^,K|;.K|,))2 

a 

and 



no II ^ iR ^^^^ 1 /I 2n/ * Nn , o i, rr\ogp + \ogA + 2\ogk 
mm ||S2afe||i7 > 16v 2(1 + 47 )(maxcra(j)(l + 8a )Knk\ 

[a,b)£T,a^b a \ n 



then AlgorithmUl estimates a graph G which satisfies ¥[G = G]> 1 — p^ . 

5. INTERPRETING EDGES 
In § O we have developed an algorithm for inferring the graph structure from multi-attribute nodal 
observations. The algorithm is based on estimating, simultaneously for all pairs of nodes, whether 
partial canonical correlation between two nodes is zero or not. Under suitable assumptions, this 
algorithm reliably recovers the network structure. In this section, we propose a post-processing 
step that will allow us to quantify the strength of links between nodes, as well as identify attributes 
that contribute to existence of links. 
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For any two nodes a and 6, for which Q,ab 7^ 0, define A/'(a, 6) = {c G b]\{«, f*} : ^ac 7^ OVfiftc / 
0}, the Markov blanket for the set of nodes {Xa,X;,}. Note that the conditional distribution of 
(X^,X'^)' given X^ is equal to the conditional distribution of (Xjj,X'jj)' given X_^((j (,). Now, 

Pc(Xa,X;,;X^) = /)c(Xa, Xfe; X^(„ fe)) 

= max Corr (u (X^ - AX^(„ m ) , v' (X^ - BX_^(a,6) ) ) , 

where A = argmin E[||Xa — AX_^(q ;,)|||] and B = argmin E[||Xf, — BX_/\/(„ ;,)|||]. Define Xl(a, h) = 
Var(Xa,Xb I X_^(a (,)). Now we can express the partial canonical correlation as 
,-v V V ^ w^SafeWh 

Pc{-^a-,-^b;-^M{a,b)) = pax ; - , = 

where 



S(a,6) 



y ^ba ^bb J 

The weight vectors Wq and can be easily found by solving the system of eigenvalue equations 

^ 1 1 

^aa Safe^fefc ^ba^^a = X'^Wa, ^^^^ 
'^bb '^ba'^aa ^afeWfo = X^Wb, 

with Wa and Wf, being the vectors that correspond to the maximum eigenvalue A^. Furthermore, 
we have pc(Xa, X;,; X^(a,fe)) = X. Following (Katenka and Kolaczyk 2011), the weights Wq, Wb can 
be used to access the relative contribution of each attribute to the edge between the nodes a and 
b. In particular, the weight (wl^)'^ characterizes the relative contribution of attribute i of node a to 

Pc0^a,^b',^Af(a,b)) ■ 

Algorithm [T] provides an estimate J\f{a,b) = {c G [p]\{a, 6} : Q,ac / V Q,hc / 0} of Af{a,b). 
Given an estimate of the Markov blanket, we form the residual vectors 

ri,a = ^i,a - AX. and ri^b = Xi,b - BX.^^^^ 

where A and B are the least square estimators of A and B. Given the residuals, we form S(a, 5), 
the empirical version of the matrix S(a, 6), by setting 

T.aa = Corr({ri_a}ig[„]), Ebb = Corr({ri,b}ig[„]), and T^ab = CoTy:{{ri^a}ie[n], {T^i,a}ie[n])- 

Now, solving the eigenvalue system in Eq. (I13p will give us estimated of the vectors Wq, Wf, and 
the partial canonical correlation. 
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Figure 1: Average hamming distance plotted against the rescaled sample size. The top row shows 
results for the glasso estimator run on each individual attribute separately and then merging the 
resulting networks. The bottom row shows the network estimated with Algorithm [T] Each column 
represents one simulation setting. Results are averaged over 100 independent runs. 

6. SIMULATION STUDIES 
In this section, we perform a set of simulation studies to illustrate finite sample performance of our 
procedure. We demonstrate that the scalings predicted by the theory are sharp. Furthermore, we 
compare against a procedure that uses the glasso first to estimate one network over each of the 
k individual attributes and then creates an edge in the resulting network if the edge appears in 
at least one of the single attribute networks. We have also tried applying the glasso to estimate 
the precision matrix for the model in ([T|) and then post-processing it, so that an edge appears in 
the resulting network if the corresponding block of the estimated precision matrix is non-zero. The 
results were worse compared to the first baseline, so we do not report them here. 

Theoretical results given in Section S] predict the sample size needed for consistent recovery 
of the underlying graph. In particular. Lemma E] suggests that we need 0(s^fc^ log(pfc)) samples 
to estimate the graph structure consistently. Therefore, if we plot the hamming distance between 
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the true and recovered graph structure against appropriately rescaled sample size, we expect the 
curves to reach zero distance for different problem sizes at a same point. We verify this on randomly 
generated chain and nearest-neighbors graphs. 

We generate data as follows. A random graph with p nodes is created by first partitioning nodes 
into p/20 connected components, each with 20 nodes, and then forming a random graph over these 
20 nodes. A chain graph is formed from by permuting the nodes and connecting them in succession, 
while a nearest-neighbor graph is constructed following the procedure outlined in Li and Gui (2006). 
That is, for each node, we draw a point uniformly at random on a unit square and compute the 
pairwise distances between nodes. Each node is then connected to s = 4 closest neighbors. Since 
some of nodes will have more than 4 adjacent edges, we remove randomly edges from nodes that 
have degree larger than 4 until the maximum degree of a node in a network is 4. Once the graph 
structure is created, we construct a precision matrix, with non-zero blocks corresponding to edges 
in the graph. Diagonal blocks are constructed as O.S'""''!, < a,6 < A;, while off-diagonal blocks 
have elements with the same value, 0.2 for chain graphs and 0.3/ k for nearest-neighbor networks. 
Finally, we add pi to the precision matrix, so that its minimum eigenvalue is equal to 0.5. Note 
that s = 2 for the chain graph and s = 4 for the nearest-neighbor graph. Simulation results are 
averaged over 100 independent runs. 

Figure [1] shows results of the simulations. The top row reports results for the glasso procedure, 
while the results in the bottom row are obtained by Algorithm [TJ Each column in the figure 
represents a different simulation setting. For the first two columns, we set k = 3 and vary the 
total number of nodes in the graph p. The third simulation setting sets the total number of nodes 
p = 20 and changes the number of attributes k. In all the simulation settings, we observe that joint 
network estimation does better. Furthermore, we note that even for the chain graph, the estimation 
over individual nodes does not recover the graph correctly. This suggests that the conditions for 
the consistent graph recovery may be weaker in the multi-attribute setting. 
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protein network 


gene network 


gene / protein network 


Number of edges 


122 


214 


249 


Density 


0.03 


0.05 


0.06 


Largest connected component 


62 


89 


82 


Avg Node Degree 


2.68 


4.70 


5.47 


Avg Clustering Coefficient 


0.0008 


0.001 


0.003 



Table 2: Summary statistics for protein, gene, and gene/protein networks {p = 91). 

7. ANALYSIS OF A GENE/PROTEIN REGULATORY NETWORK 
We provide illustrative, exploratory analysis of data from the well-known NCI-60 database, which 
contains different molecular profiles on a panel of 60 diverse human cancer cell lineqj- Data set 
consists of protein profiles (normalized reverse-phase lysate arrays (RPLA) for 92 antibodies) and 
gene profiles (normalized RNA microarray intensities from Human Genome U95 Affymetrix chip- 
set for > 9000 genes). We focus our analysis on a subset of 91 genes/proteins for which both types 
of profiles are available. These profiles are available across the same set of 60 cancer cells. More 
detailed description of the data set can be found in Katenka and Kolaczyk (2011). 

We inferred three types of networks: a network based on protein measurements alone, a network 
based on gene expression profiles and a single gene/protein network. For protein and gene networks 
we use the glasso, while for the gene/protein network, we use Algorithm [TJ We use the stability 



protein network 



gene network 



gene/protein network 




10 15 20 




10 15 20 




10 15 20 



node degree 



node degree 



node degree 



Figure 2: Node degree distributions for protein, gene and gene/protein networks. 
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selection (Meinshausen and Biihlmann 2010) procedure to estimate stable networks. In particular, 
we first select the penalty parameter A using cross-validation, which over-selects the number of 
edges in a network. Next, we use the selected A to estimate 100 networks based on random 
subsamples containing 80% of the data-points. Final network is composed of stable edges that 
appear in at least 95 of the estimated networks. Table [2] provides a few summary statistics for the 
estimated networks. Furthermore, protein and gene/protein networks share 96 edges, while gene 
and gene/protein networks share 104 edges. Gene and protein network share only 17 edges. Finally, 
66 edges are unique to gene/protein network. Figure [2] shows node degree distributions for the three 
networks. We observe that the estimated networks are much sparser than the association networks 
in Katenka and Kolaczyk (2011), as expected due to marginal correlations between a number of 
nodes. The differences in networks require a closer biological inspection by a domain scientist. 

We proceed with a further exploratory analysis of the gene/protein network. We investigate 
the contribution of two nodal attributes to the existence of an edges between the nodes. Following 
(Katenka and Kolaczyk 2011), we use a simple heuristic based on the weight vectors to classify the 
nodes and edges into three classes. For an edge between the nodes a and b, we take one weight 
vector, say Wq, and normalize it to have unit norm. Denote Wp the component corresponding to 
the protein attribute. Left plot in Figure [3] shows the values of Wp over all edges. The edges can 
be classified into three classes based on the value of Wp. Given a threshold T, the edges for which 
Wp G (0, T) are classified as gene-influenced, the edges for which Wp £ {1 — T, 1) are classified as 
protein influenced, while the remainder of the edges are classified as mixed type. In the left plot of 
Figure El the threshold is set as T = 0.25. Similar classification can be performed for nodes after 
computing the proportion of incident edges. Let pi, p2 and denote proportions of gene, protein 
and mixed edges, respectively, incident with a node. These proportions are represented in a simplex 
in the right subplot of Figure [3l Nodes with mostly gene edges are located in the lower left corner, 
while the nodes with mostly protein edges are located in the lower right corner. Mixed nodes are 
located in the center and towards the top corner of the simplex. Further biological enrichment 
analysis is possible (see Katenka and Kolaczyk (2011)), however, we do not pursue this here. 
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Figure 3: Left subplot represents the value Wp over all edges in the gene/protein network. Based on 
the value Wp and the threshold T = 0.25, edges can be classified into three classes. Region 1 denotes 
edges that are included in the graph primarily due to the influence of gene attribute, while region 
3 collects edges that are primarily influenced by the protein attribute. Region 2 contains edges of 
mixed type. Right subplot represent nodes in the network in a simplex based on the proportion of 
edges incident to them. Proportions pi, p2 and ps denote proportions of gene, mixed and protein 
edges, respectively, incident to a node. Three regions are indicated with the dashed lines, with the 
region 1 and 3 containing gene and protein nodes, respectively. Nodes in the region 2 are of mixed 
type. 

8. DISCUSSION AND EXTENSIONS 
In this paper, we have proposed a solution to the problem of learning networks from multivariate 
nodal attributes, which arises in a variety of domains. Our method maximizes the penalized 
likelihood under the model in Eq. ([T|), which simultaneously estimates for all partial canonical 
correlation coefficients whether they are zero or not. When all the attributes across all the nodes 
follow joint multivariate Normal distribution, our procedure is equivalent to estimating conditional 
independencies between nodes, which is revealed by relating the blocks of the precision matrix 
to partial canonical correlation. Although a penalized likelihood framework is adopted in the 
current paper for estimation of the non-zero blocks of the precision matrix, other approaches like 
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neighborhood pursuit or greedy pursuit can also be developed. Thorough numerical evaluations and 
theoretical analysis of these methods is an interesting direction for future work. Another interesting 
direction is to explore the semiparametric and nonparametric extensions of our method to more 
general classes of networks. 
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APPENDIX A. PROOFS OF TECHNICAL RESULTS 
A.l Proof of Lemma [T] 

The optimization objective given in Eq. can be written in the equivalent constrained form as 



trSfJ - log |ri| subject to "^WftabWr < C{X). 



mm 

a,b 

The procedure involves minimizing a continuous objective over a compact set, and so by Weierstrass 
theorem, the minimum is always achieved. Furthermore, the objective is strongly convex and 
therefore the minimum is unique. 

The solution to the optimization problem (jlj) satisfies 

S-n-^ + AZ = (A.l) 

where Z e f^X^a b H^af*!!^ element of the sub-differential and satisfies ||Zaft||i? < 1 for all 

{a,h) G [p]^. Therefore, 

Amax(J^"^) < Amax(S) + AAi^ax(Z) < Amax(S) + Ap. 

Next, we prove an upper bound on Ainax(^^)- At optimum, the primal-dual gap is zero, which 
gives that 

^W^abWr < A"^(X] kj-tiSh) < A~^ kj, 

a,b je[p] je[p] 

as S ^ and ;^ 0. Since Ajnax(^^) < T^ab ll^afellF) the proof is done. 

A. 2 Proof of Lemma [2] 

We have that V/(A) = S - A~^. Then 

||V/(A) - V/(AO||f = IIA-i - (A')-1||f 

<A A"-*-!! A — A'll irA A"-*- 

<7-2||A- A'lli.. 

A. 3 Proof of Lemma [3] 

By construction, the sequence of estimates (ri'^*))t>i decrease the objective value and are positive 
definite. 
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To prove the convergence, we first introduce some additional notation. Let /(fi) = trSfi — 
log|n| and F{n) = /{fl) + ^^hW^abWr- For any L > 0, let 

QL{ft;n) := f{n) + tT[{n-n)vfin)] + ^\\n-n\\l + Y,\\^ab\\F 

ab 

be a quadratic approximation of F{Q) at a given point fl, which has a unique minimizer 

Pl(S^) := argnnnQL(f2; ri). 
From Lemma 2.3. in Beck and Teboulle (2009), we have that 

Fm - FiPLim > ^\\PLm - mi (a.2) 

if F{pLiTi)) < QL{pLiTl);Ti). Note that F(pL(n)) < QL(pL(n); iT) always holds if L is as large as 
the Lipschitz constant of V-F. 

Let ri^*"^) and Jl^*^ denote two successive iterates obtained by Algorithm [TJ Without loss of 
generality, we can assume that fi^*) is obtained by updating the rows/columns corresponding to 
the node a. From Eq. (|A.2p . it follows that 

_ > iinii-^) - m\F +2^.0:;'^ - fiSib (A.3) 

where L^. is a current estimate of the Lipschitz constant. Recall that in Algorithm [1] the scalar t 
serves as a local approximation of 1/L. Since eigenvalues of CI are bounded according to Lemmadl 
we can conclude that the eigenvalues of fi*-*"^^ are bounded as well. Therefore the current Lipschitz 
constant is bounded away from zero, using Lemma [2j Combining the results, we observe that the 
right hand side of Eq. (jA.SP converges to zero as i — )• oo, since the optimization procedure produces 
iterates that decrease the objective value. This shows that \\ftaa — Ctal\\p + 2^^_^^ ll^il ~ 
^iftll-P' converges to zero, for any a £ [p]. Since (fi^*) is a bounded sequence, it has a limit point, 
which we denote fl. It is easy to see, from the stationary conditions for the optimization problem 
given in Eq. that the limit point fi also satisfies the global KKT conditions to the optimization 
problem in Eq. (jj]). 

A. 4 Proof of Lemma [5] 

Suppose that the solution to Eq. ^ is block diagonal with blocks Pi, P2, ■ ■ ■ , Pi- For two nodes 
a, b in different blocks, we have that = as the inverse of the block diagonal matrix is block 
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diagonal. From the KKT conditions, it follows that ||S(jb||i? < A. 

Now suppose that ||Sab||F < A for all a £ Pj,b £ Pj',j 7^ j' ■ For every /' = !,...,/ construct 

n^/ = arg min tr S^/fi;/ — log iri/'l + A > 1 1 flab 1 1 j?. 

a,b 

Then fl = diag(rii, CI2, ■ ■ ■ , ^i) is the solution of Eq. @ as it satisfies the KKT conditions. 

APPENDIX B. NETWORK ESTIMATION CONSISTENCY 
In this appendix, we provide sufficient conditions for consistent network estimation using Algo- 
rithm [TJ Lemma E] given in 2] is then a simple consequence. To provide sufficient conditions, we 
extend the work of Ravikumar et al. (2011) to our setting, where we observe multiple attributes 
for each node. In particular, we extend their Theorem 1. 

For simplicity of presentation, we assume that ka = k, for all a G [p], that is, we assume that 
the same number of attributes is observed for each node. Our assumptions involve the Hessian of 
the function /(A) = tr SA — log |A| evaluated at the true ft*, 

>*^-l ^ |^(pfc)2x(pfc)2 



n = n{ft*) = (ft*) 



(n* 



(A.4) 



and the true covariance matrix 5]*. The Hessian and the covariance matrix can be thought of block 
matrices with blocks of size k"^ x k'^ and k x k, respectively. We will make use of the operator C(-) 
that operates on these block matrices and outputs a smaller matrix with elements that equal to 
the Frobenius norm of the original blocks, 



All A12 
A21 A22 



^2p 



c{-) 



|Aii||f 

|A2i||f 



I A12IIF 
I A22IIF 



lAlplliT' 

|A2p||f 



\ IIApill^ 



y Api • • • App J 

In particular, C(5]*) G and C{n) G IRp'^p". 

We denote the index set of the non-zero blocks of the precision matrix as 

T ■.= {ia,b) eV xV : \\n*J\2^0}U{ia,a) : a e V} 

and let Af denote its complement inVxV, that is, 

M={ia,b) : \\nab\\F = 0}. 



^ppWf J 
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As mentioned earlier, we need to make an assumption on the Hessian matrix, which takes the 
standard irrepresentable-like form. 

Assumption: There exists a constant a € [0, 1) such that 

|||C (HArr(Hrr)"') loo < 1 - a. (A.5) 

These condition extends the irrepresentable condition given in Ravikumar et al. (2011), which was 
needed for estimation of networks from single attribute observations. It is worth noting, that the 
condition given in Eq. (jA.5p can be much weaker than the irrepresentable condition of Ravikumar 
et al. (2011) applied directly to the full Hessian matrix. This can be observed in simulations done 
in ^ where a chain network is not consistently estimated even with a large number of samples. 
We will also need the following two quantities to specify the results 

= |||C(5:*)|||oo (A.6) 

and 

Kn = |||C(?^^V)llloo- (A.7) 

Finally, the results are going to depend on the tail bounds for the elements of the matrix 
C(S— S*). We will assume that there is a constant v^, G (0, oo] and a function / : Nx (0, oo) i— ?• (0, oo) 
such that for any {a,b) £ V x V 

P[C(S - S*),fc > 5] < — i- 5e{0,v-']. (A.8) 
f{n,d) 

The function /(n, S) will be monotonically increasing in both n and 5. Therefore, we define the 
following two inverse functions 

nf{S;r) = argmaxjn : f{n,6) < r} (A. 9) 

and 

6f{r;n) = argmax{5 : f{n,d) < r} (A. 10) 

for r G [1, oo). 

With the notation introduced, we have the following result. 
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Theorem 7. Assume that the irrepresentable condition in Eq. (jA.SP is satisfied and that there exists 
a constant v^, G (0, oo] and a function f{n, 6) so that Eq. (jA.Sp is satisfied for any (a, b) £ V x V. 
Let 



for some r > 2. // 



A = -Sf{n,p'') 
a 



n > Uf [ — o — 9-TT>P^ (A-ll) 

\max(f,,, 6(1 + 8a max(«;x;*K-^, Kj^. k;^)) 



then 

\\C{n - n)||oo < 2(1 + 8a~^)Kn6fin,p^) (A.12) 
with probability at least 1 — p'^~'^ . 

Theorem [7] is of the same form as Theorem 1 in Ravikumar et al. (2011), but the l^o element- 
wise convergence is estabhshed for C(ri — fi), which wih guarantee successful recovery of non-zero 
partial canonical correlations if the blocks of the true precision matrix are sufficiently large. 

Theorem [7] is proven as Theorem 1 in Ravikumar et al. (2011). We provide technical results in 
Lemma [HI Lemma [9] and Lemma [T0| which can be used to substitute results of Lemma 4, Lemma 5 
and Lemma 6 in Ravikumar et al. (2011) under our setting. The rest of the arguments then go 
through. Below we provide some more details. 

First, let Z : W'^^'p'^ -^pkxpk ^^le mapping defined as 



WW if||A.,||p/0, 



Z with llZlliT' < 1 if \\Aab\\F = 0, 



(A.13) 



Next, define the function 

G(n) = trfiS - log|n| + A||C(n)||i, yflyO (A.14) 
and the following system of equations 



(A.15) 



Sab-{^-^)ab = -X2ifl)ab, if / 

\\Sab-i^-^)ab\\F < A, ifnafe = 0. 

It is known that fi G MP^^ is the minimizer of optimization problem in Eq. @ if and only if it 
satisfies the system of equations given in Eq. (jA.lSp . We have already shown in Lemma [T] that the 
minimizer is unique. 
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Let ri be the solution to the following constrained optimization problem 

mm trSn - log |n|+A||C(n)||i subject to C(n)afe = 0, \/{a,b)(^N. (A.16) 

Observe that one cannot find in practice, as it depends on the unknown set N . However, it is 
a useful construction in the proof. We will prove that f2 is solution to the optimization problem 
given in Eq. Q, that is, we will show that satisfies the system of equations (jA.lSp . 
Using the first-order Taylor expansion we have that 

f^-i = - A(n*)-i + i2(A), (A.17) 

where A = f2 — il* and i?(A) denotes the remainder term. With this, we state and prove Lemma[8l 
Lemma [9] and Lemma [TOl They can be combined as in Ravikumar et al. (2011) to complete the 
proof of Theorem [71 

Lemma 8. Assume that 

max||Aa6||F < — and max - Sab||F < — . (A.18) 

ah o ab o 

Then is the solution to the optimization problem in Eq. 

Proof. We use R to denote R(A). Recall that A_/v- = by construction. Using (|A.17p we can 
rewrite (IA.15P as 

?^ab,rAr-Rab + Sa6-S*6 + AZ(nU = if(a,6)Gr (A.19) 

||Hab,rAr-Rab + S„fc-S*bll2 < A if(a,6)GAA. (A.20) 

By construction, the solution satisfy ()A.19p . Under the assumptions, we show that ()A.20p is also 
satisfied with inequality. 

From (jA.lOp . we can solve for A7-, 

Ar = H:^^^[Rr - Sr + Sr - A^(S7)r]. 

Then 

\\'Hab,T'Hr]T^'T ~ '^T + Sr - ^2:{Q,)r] - Rafe + Sab - ^afclb 

Sab — '^ab\\2 

<A(l-a) + (2-a)^ 
< A 
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using assumption on n in (1X5]) and ([Al8]l . This shows that Q, satisfies (|Al5]) . 
Lemma 9. Assume that 

l|C(A)|U< ^ 



T/ien 

||C(R(A))|U<:^4.||C(A)||L. 
Proof. Remainder term can be written as 

R(A) = {n* + A)-i - + (n*)-iA(rj*)-^ 

Using (lA.27p . we have that 

|||C((n*)-^A)||U < |||C((n*)-i)|||oo|||C(A)||U 

<s|||C((n*)-i)||U||C(A)|U 
1 

< - 
- 3 

which gives us the following expansion 

{n* + A)-i = - (n*)"^A(rj*)-i + (rj*)-iA(n*)-iAj(rj*)- 

with J = Y.k>oi-'^f ■ Using (fA:28]l and dA^T]), we have that 

||C(R)|U < ||C((n*)-iA)|U|||C((n*)-iAJ(n*)-i)'||U 
<|||C((n*)-i)|||L||C(A)|U|||C(J')|||oo|||C(A)||U 

<s|||c((n*)-i)|||Ll|c(A)||Lll|c(j')|||oc. 



Next, we have that 



iiic(J0iiioo<5]r(A(n*)-i)iii^ 

fc>0 

1 



< 



1 - iiic(A(n*)-i: 

3 



which gives us 



as claimed. 



< 
- 2' 



3s 

||C(R)||oo<y4*l|C(A)||^ 
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Lemma 10. Assume that 

r := 2k«(||C(S - + A) < min f , \ ) ■ (A.23) 

Then 

||C(A)|U<r. (A.24) 

Proof. The proof follows the proof of Lemma 6 in Ravikumar, Wainwright, Raskutti and Yu (2008). 
Define the ball 

B{r):={A : C(AU < r, V(a, 6) G T}, 



the gradient mapping 



and 



F(Ar) = -n:^\G{n*^ + At) + Ar- 

We need to show that F{B{r)) C B{r), which implies that ||C(A7-)||oo < r. 

Under the assumptions of the lemma, for any A5 G B{r), we have the following decomposition 

F(Ar) = 7^:^^R(A)r + ^rr(Sr - + A^("* + ^)r)- 
Using Lemma [9l the first term can be bounded as 

||C(^^VR(A)r)||oo < r(?^^V)lll°ollC(R(A)||oo 
<-^H4:A\cm\lo 

< r/2 

where the last inequality follows under the assumptions. Similarly 

\\e{n^\{ST - + + A)r)||oc 

< |||C(^:^^)||U(||C(S - s*)|U + \\\c{z{n* + A))|U) 

<K^(||C(S-S*)|U + A) 

< r/2. 

This shows that F(S(r)) C B{r). □ 
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The following result is a corollary of Theorem [71 which shows that the graph structure can be 
estimated consistently under some assumptions. 

Corollary 11. Assume that the conditions of Theorem^ are satisfied. Furthermore, suppose that 

min \\Q\\f > 2{1 + 8a~^)K,'H^f{n,p'^) 

{a,b)£T, a^b 

then Algorithm 1 estimates a graph G which satisfies 

P[G/ G] > l-p^-\ 

Next, we specialize the result of Theorem [7] to a case where X has sub-Gaussian tails. That 
is, the random vector X = (Xi, . . . ,Xpf^y is zero-mean with covariance I]*. Each (cr*„)~^/^Xa is 
sub-Gaussian with parameter 7. 

Lemma 12. Set the penalty parameter in A in Eq. @ as 



A = 8A;a"^y^l28(l + 472)2(max(o-*„)2)n-i(21og(2A;) +rlog(p)). 

// 

n > Cis2A;^(1 + 8a~^)^(rlogp + log4 + 21og/c) 
where Ci = (48\/2(l + 472)(maxa (T*„) max(Ks*K'H, '^l;.'^!^))^ the 



len 



\\m - ")lloo < 16V2(1 + 47^) max.Sd + 8a-)K«fc J^^^^^^^±i^^^i±^ 

i V n 

with probability 1 — p'^~'^ . 

The proof simply follows by observing that, for any (a, b), 

P[C(S - >S]< n max (a,, - a*,f > 5^/^] 

{c,d)e{a,b) 

<k^¥[\a,d-c7*J>6/k] (A.25) 
< 4A:^ exp 



for all 6 G (0, 8(1 + 472)(maXa <t*„)) with = 128(1 + 4j'^)^{maXa{a*^f). Therefore, 

1 n6'^ 

n,(5;r) = ^-^ 
6f{r;n) = 



k'^ \og{4:k'^r) 
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Theorem [7] and some simple algebra complete the proof. 
Lemma E] is a simple conseqeuence of Lemma \TZ[ 

APPENDIX C. SOME RESULTS ON NORMS OF BLOCK MATRICES 
Let T be a partition of [p]. Throughout this section, we assume that matrices A, B G MP^^ and a 
vector b G are partitioned into blocks according to T- 



Lemma 13. 



Proof. For any a £ T, 



max II Aa.bl k < max \ \ | A^fel |f max | Ibdh- 



|Aa.b||2 < ^ ||Aafebb||2 



beT V iea 



<EJEiiA..iiiiib, 

beT V «6« 



1,2 
b\\2 



- EjE 11^*^112 lib':! I2 



beT V iGa 



EIIAatlliT'maxllb 



b£T 



■c 2- 



(A.26) 



□ 



Lemma 14. 



|||C(AB)||U < |||C(B)||U|||C(A) 



(A.27) 



Proof. Let C = AB and let 7" be a partition of [p]. 



IICfAB) 



00 = max 



ab\\F 



b€T 



< maxV] ||Aac||F||Bcb||F 
aer ^ 

b c 

< {maxV] IIAacllFlimaxy] ||Bc;,||f} 

aer ^ cer ^ 

= |||C(A)||Ur(B)||U. 



□ 
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Lemma 15. 

||C(AB)||oo < ||C(A)|U|||C(By||U. (A.28) 

Proof. For a fixed a and b, 

C(AB)ab = II ^ AacBcblli? 
c 

< ^||A„c||f||Bc6||f 

c 

< max||Aac|| ||Bc6||f- 

c 

Maximizing over a and 6 gives the result. □ 

APPENDIX D. PROOF OF EQ. 3 

First, we note that 



Var (X;,X; 



- ^ab,ab - ^afe,Z6^^^^^^,, 



is the conditional covariance matrix of (Xjj,X^)' given the remaining nodes X^ (see Proposition 
C.5 in Lauritzen (1996)). Define S = 'Sab,ab — '^abah^^'i^^ab ab' P^-i^tial canonical correlation 
between Xq and X^ is equal to zero if and only if S^fc = 0- On the other hand, the matrix inversion 
lemma gives that ^ab,ab = ^ ^ • Now, Ctab = if and only if S^f, = 0. This shows the equivalence 
relationship in Eq. ([3]). 
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